Goto

Collaborating Authors

 voice sample


A Lightweight Pipeline for Noisy Speech Voice Cloning and Accurate Lip Sync Synthesis

Amir, Javeria, Attaria, Farwa, Jabeen, Mah, Noor, Umara, Rashid, Zahid

arXiv.org Artificial Intelligence

Corresponding Author: Umara Noor Abstract Recent developments in voice cloning and talking-head generation demonstrate impressive capabilities in synthesizing natural speech and realistic lip synchronization. Current methods typically require and are trained on large-scale datasets and computationally intensive processes using clean, studio-recorded inputs, which is infeasible in noisy or low-resource environments. In this paper, we introduce a new modular pipeline comprising Tortoise text to speech, a transformer-based latent diffusion model that can perform high-fidelity zero-shot voice cloning given only a few training samples, and Wav2Lip, a lightweight generative adversarial network architecture for robust real-time lip synchronization. The solution will contribute to many essential tasks concerning less reliance on massive pretraining, generation of emotionally expressive speech, and lip-sync in noisy and unconstrained scenarios. In addition, the modular structure of the pipeline allows an easy extension for future multimodal and text-guided voice modulation, and it could be used in real-world systems. Our experimental results show that the proposed system produces competition-level sound quality and lip-sync with a much smaller computational cost, indicating the possibility of deploying it in resource-constrained scenarios. Keywords Zero-Shot Voice Cloning, Latent Diffusion Models, Real-Time Lip Synchronization, GAN-Based Talking-Head Generation, Low-Resource Speech Synthesis, Emotionally Expressive Speech 1. Introduction Voice clone and talking head generation systems have made tremendous progress in the past few years, benefiting from the development of deep and generative models. These devices can be employed for virtual assistants, entertainment, telepresence, and assistive communication, making human-computer interaction more realistic and personalized, based on interactive and audio-visual context. Despite advancements, the state-of-the-art solutions heavily rely on big data and sophisticated computational resources and therefore may not be practical for real-world low-resource or noisy settings.


Phone scammers are using faked AI voices. Here's how to protect yourself

PCWorld

Never before has it been easier to clone a human voice. New AI tools can take a voice sample, process it, copy it, and say anything in the voice of the original. It's been a thing since as early as 2018, but modern tools can do it faster, more accurately, and with greater ease. OpenAI, the artificial intelligence company behind ChatGPT, presented a project this year that showed how it's possible to clone a voice with nothing more than a 15-second recording. OpenAI's tool isn't yet publicly available and it's said to have security measures in place to prevent misuse.


I cloned my voice using AI and the results were terrifying… DailyMail.com tries app that replicated President Joe Biden's speech to scam voters in New Hampshire

Daily Mail - Science & tech

It captured everything from the way I tend to'Umm' and'Aah' between words to the way I raise my voice when asking a question New Hampshire residents received a strange call telling them to skip the primary election and while it sounded like Joe Biden on the other end, it was an AI-clone of his voice. An anonymous fraudster used the app Eleven Labs to replicate Biden's voice for the attack last month - I tested the app to see just how believable an AI-cloned voice is. The AI-generated voice tricked a friend into thinking a message was truly from me. 'Why did you send me a voice note,' my friend replied to the message. 'You normally just email - but it's nice to hear from you!' My father also admitted the fake voice would have fooled him and when my wife heard a short message and said, 'Oh my God, I want to throw it off a bridge.'


StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis

Zhang, Yu, Huang, Rongjie, Li, Ruiqi, He, JinZheng, Xia, Yan, Chen, Feiyang, Duan, Xinyu, Huai, Baoxing, Zhao, Zhou

arXiv.org Artificial Intelligence

Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles (such as timbre, emotion, pronunciation, and articulation skills) derived from reference singing voice samples. However, the endeavor to model the intricate nuances of singing voice styles is an arduous task, as singing voices possess a remarkable degree of expressiveness. Moreover, existing SVS methods encounter a decline in the quality of synthesized singing voices in OOD scenarios, as they rest upon the assumption that the target vocal attributes are discernible during the training phase. To overcome these challenges, we propose StyleSinger, the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples. StyleSinger incorporates two critical approaches for enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a residual quantization module to capture diverse style characteristics in singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to perturb the style attributes within the content representation during the training phase and thus improve the model generalization. Our extensive evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples. Access to singing voice samples can be found at https://stylesinger.github.io/.


FinBTech: Blockchain-Based Video and Voice Authentication System for Enhanced Security in Financial Transactions Utilizing FaceNet512 and Gaussian Mixture Models

Laila, Prof N. Jeenath, Tamilpavai, Dr G.

arXiv.org Artificial Intelligence

In the digital age, it is crucial to make sure that financial transactions are as secure and reliable as possible. This abstract offers a ground-breaking method that combines smart contracts, blockchain technology, FaceNet512 for improved face recognition, and Gaussian Mixture Models (GMM) for speech authentication to create a system for video and audio verification that is unmatched. Smart contracts and the immutable ledger of the blockchain are combined to offer a safe and open environment for financial transactions. FaceNet512 and GMM offer multi-factor biometric authentication simultaneously, enhancing security to new heights. By combining cutting-edge technology, this system offers a strong defense against identity theft and illegal access, establishing a new benchmark for safe financial transactions.


AI voices are hard to spot even if you know audio might be a deepfake

New Scientist

Could you tell if you were listening to an AI-generated voice? Even when people know they may be listening to AI-generated speech, it is still difficult for both English and Mandarin speakers to reliably detect a deepfake voice. That means billions of people who understand the world's most spoken languages are potentially at risk when exposed to deepfake scams or misinformation. Kimberly Mai at University College London and her colleagues challenged more than 500 people to identify speech deepfakes among multiple audio clips. Some clips contained the authentic voice of a female speaker reading generic sentences in either English or Mandarin, while others were deepfakes created by generative AIs trained on female voices.


AI Tom Hanks didn't offer me a job, but it sure sounds like he did

PCWorld

Tom Hanks didn't just call me to pitch me a part, but it sure sounds like it. Ever since PCWorld began covering the rise of various AI applications like AI art, I've been poking around in the code repositories in GitHub and links within Reddit, where people will post tweaks to their own AI models for various approaches. Some of these models actually end up on commercial sites, which either roll their own algorithms or adapt others that have published as open source. A great example of an existing AI audio site is Uberduck.ai, Enter the text in the text field and you can have a virtual Elon Musk, Bill Gates, Peggy Hill, Daffy Duck, Alex Trebek, Beavis, The Joker, or even Siri read out your pre-programmed lines.


How AI Is Revolutionizing The Ways We Can Detect Mental Illness

#artificialintelligence

Predictive AI applications are relatively new to mental and behavioral health, but are already showing a lot of promise. In a recent publication on detecting suicide risk through analyzing text messages, UW Medicine researchers found that algorithms performed as well as trained evaluators. This is great news for predictive AI and the ability to save lives at risk for suicide through data analysis in real-time, when and where the individual is located. This is important because some healthcare providers may be concerned when they communicate by text message with a patient, they might miss something they are trained to pick up from voice inflection, facial expression, and other auditory or physical signals. Algorithms like this can help enhance the provider's ability to analyze the patient when communicating by text, an increasingly popular way for people to access mental health.


Artificial intelligence can tell if you've got heart problems simply by the sound of your voice

#artificialintelligence

Advanced AI technology can now calculate a person's risk of suffering a heart attack just by listening to their voice, a new study says. Researchers at the Mayo Clinic say people who have a high voice biomarker score are more than twice as likely to suffer major heart problems related to clogged arteries. Coronary heart disease is the most common cause of heart attack and is one of the leading causes of death for both men and women worldwide. It occurs when the heart's blood supply is blocked or interrupted by a build-up of plaque in the arteries. Now, scientists have developed a "powerful screening tool" dubbed the Vocalis Health algorithm to help identify those who are particularly at risk and may need close monitoring.


AI calculates chance of a heart attack just by listening to person's voice

#artificialintelligence

State of the art AI technology can calculate a person's risk of suffering a heart attack just by listening to their VOICE, according to a new study. People who have a high pitched voice are more than twice as likely to suffer major heart problems related to clogged arteries, say scientists. Coronary heart disease is the most common cause of heart attack and is one of the leading causes of death for both men and women worldwide. It occurs when the heart's blood supply is blocked or interrupted by a build-up of plaque in the coronary arteries. Now scientists have developed a "powerful screening tool" dubbed the Vocalis Health algorithm to help identify those who are particularly at risk and may need close monitoring.